Master’s Thesis in Computer Science: Recognizing artist ambiguity with machine learning techniques
نویسنده
چکیده
Spotify is a music streaming service whose ambition is to offer everyone easy access to all of the world’s music. Maintaining metadata quality is a a nontrivial challenge considering that the catalogue size rules out the possibility for manual curation and that the number of content providers and different delivery formats is large. An interesting problem that involves metadata is to have a one-to-one mapping between real-world and Spotify artists: two different artists should be two separate entities on Spotify. Currently, when several albums by artists with the same name are sent to the service, they are arranged under the same artist identifier, assuming that they belong together. This is an issue when two artists with the same name have their albums on Spotify: all the albums are grouped together, so users think they belong to the same artist and fans have a bad user experience when presented with additional clutter besides their favorite artist’s releases. This thesis explores the use of machine learning techniques to detect Spotify artists credited as having produced albums that actually belong to several real-world artists, namely ambiguous artists. Several features for representing the artists are presented, such as the existence of multiple matches between the Spotify artist and external music databases with curated content and the number of countries the artist has registered recordings in. Using every possible combination of the features, the examples are classified with Naive Bayes and logistic regression. Two of the resulting best performant classifiers with low false positive rates are then queried with unseen data sets of random and most popular artists to assess their predictions. We found that the most useful features for the classification were the existence of multiple matches between the Spotify artist and the external music databases, the artist’s name length and the number of languages of the artist’s track names. Logistic regression proved to be superior to Naive Bayes on a test set of random artists. When the classifier has detected the ambiguous artists, after a manual artist separation process, heuristics can be put into place so that incoming albums that could belong to more than one artist will be assigned to the most likely one. This practical solution to the artist ambiguity problem is also briefly discussed.
منابع مشابه
Machine Learning and Citizen Science: Opportunities and Challenges of Human-Computer Interaction
Background and Aim: In processing large data, scientists have to perform the tedious task of analyzing hefty bulk of data. Machine learning techniques are a potential solution to this problem. In citizen science, human and artificial intelligence may be unified to facilitate this effort. Considering the ambiguities in machine performance and management of user-generated data, this paper aims to...
متن کاملUsing Machine Learning Algorithms for Automatic Cyber Bullying Detection in Arabic Social Media
Social media allows people interact to express their thoughts or feelings about different subjects. However, some of users may write offensive twits to other via social media which known as cyber bullying. Successful prevention depends on automatically detecting malicious messages. Automatic detection of bullying in the text of social media by analyzing the text "twits" via one of the machine l...
متن کاملSports Result Prediction Based on Machine Learning and Computational Intelligence Approaches: A Survey
In the current world, sports produce considerable statistical information about each player, team, games, and seasons. Traditional sports science believed science to be owned by experts, coaches, team managers, and analyzers. However, sports organizations have recently realized the abundant science available in their data and sought to take advantage of that science through the use of data mini...
متن کاملMarzieh Zare Single Cell Analysis of Z Ring Formation in Esche- Richia Coli Using Machine Learning Methods
MARZIEH ZARE: TUT Thesis Template Tampere University of Technology Master of Science Thesis, 59 pages December 2016 Master’s Degree Programme in Information Technology Major: Signal Processing Examiners: Prof. Ulla Ruotsalainen, Associate Prof. Andre S. Ribeiro, and Assistant Prof. Sari Peltonen
متن کاملMachine learning algorithms in air quality modeling
Modern studies in the field of environment science and engineering show that deterministic models struggle to capture the relationship between the concentration of atmospheric pollutants and their emission sources. The recent advances in statistical modeling based on machine learning approaches have emerged as solution to tackle these issues. It is a fact that, input variable type largely affec...
متن کامل